Improvements to Korektor: A Case Study with Native and Non-Native Czech
نویسندگان
چکیده
We present recent developments of Korektor, a statistical spell checking system. In addition to lexicon, Korektor uses language models to find real-word errors, detectable only in context. The models and error probabilities, learned from error corpora, are also used to suggest the most likely corrections. Korektor was originally trained on a small error corpus and used language models extracted from an in-house corpus WebColl. We show two recent improvements: • We built new language models from freely available (shuffled) versions of the Czech National Corpus and show that these perform consistently better on texts produced both by native speakers and nonnative learners of Czech. • We trained new error models on a manually annotated learner corpus and show that they perform better than the standard error model (in error detection) not only for the learners’ texts, but also for our standard evaluation data of native Czech. For error correction, the standard error model outperformed non-native models in 2 out of 3 test datasets. We discuss reasons for this not-quite-intuitive improvement. Based on these findings and on an analysis of errors in both native and learners’ Czech, we propose directions for further improvements of Korektor.
منابع مشابه
Comparing Lexical Bundles in Hard Science Lectures; A Case of Native and Non-Native University Lecturers
Researchers stated that learning and applying certain set of lexical bundles of native lecturers by non-native lecturers would help students improve their proficiency through incidental vocabulary input. The present study shed light on the lexical bundles in hard science lectures used by Native and Non-native lecturers in international universities with the main purpose of analyzing the structu...
متن کاملThe Use of Lexical Bundles in Native and Non-native Post-graduate Writing: The Case of Applied Linguistics MA Theses
Connor et al. (2008) mention “specifying textual requirements of genres” (p.12) as one of the reasons which have motivated researchers in the analysis of writing. Members of each genre should be able to produce and retrieve these textual requirements appropriately to be considered communicatively proficient. One of the textual requirements of genres is regularities of specific forms and content...
متن کاملLanguage Learning and Language Teaching:Episodes of the Lives of Six EFL Teachers in Iran
Teachers are the most important players of every educational system in different societies; accordingly, understanding their personal reflections may help us gain valuable insights into what it means to be a teacher in a specific cultural and social context. The purpose of this case study was to investigate the life and career of 6 non-native English speaking teachers in state educational syste...
متن کاملThe Discursive Construction of “Native” and “Non-Native” Speaker English Teacher Identities in Japan: A Linguistic Ethnographic Investigation
Recent poststructuralist theories of identity posit identities as being discursively constructed in interactions with society, institutions, and individuals. This study used a Linguistic Ethnographic framework to investigate the discursive identity construction of two English teachers, one ‘non-native’ English speaker, and one ‘native’ English speaker, teaching English in a tertiary institution...
متن کاملKorektor - A System for Contextual Spell-Checking and Diacritics Completion
We present Korektor – a flexible and powerful purely statistical text correction tool for Czech that goes beyond a traditional spell checker. We use a combination of several language models and an error model to offer the best ordering of correction proposals and also to find errors that cannot be detected by simple spell checkers, namely spelling errors that happen to be homographs of existing...
متن کامل